Goto

Collaborating Authors

 case fact


TathyaNyaya and FactLegalLlama: Advancing Factual Judgment Prediction and Explanation in the Indian Legal Context

Nigam, Shubham Kumar, Patnaik, Balaramamahanthi Deepak, Mishra, Shivam, Shallum, Noel, Ghosh, Kripabandhu, Bhattacharya, Arnab

arXiv.org Artificial Intelligence

In the landscape of Fact-based Judgment Prediction and Explanation (FJPE), reliance on factual data is essential for developing robust and realistic AI-driven decision-making tools. This paper introduces TathyaNyaya, the largest annotated dataset for FJPE tailored to the Indian legal context, encompassing judgments from the Supreme Court of India and various High Courts. Derived from the Hindi terms "Tathya" (fact) and "Nyaya" (justice), the TathyaNyaya dataset is uniquely designed to focus on factual statements rather than complete legal texts, reflecting real-world judicial processes where factual data drives outcomes. Complementing this dataset, we present FactLegalLlama, an instruction-tuned variant of the LLaMa-3-8B Large Language Model (LLM), optimized for generating high-quality explanations in FJPE tasks. Finetuned on the factual data in TathyaNyaya, FactLegalLlama integrates predictive accuracy with coherent, contextually relevant explanations, addressing the critical need for transparency and interpretability in AI-assisted legal systems. Our methodology combines transformers for binary judgment prediction with FactLegalLlama for explanation generation, creating a robust framework for advancing FJPE in the Indian legal domain. TathyaNyaya not only surpasses existing datasets in scale and diversity but also establishes a benchmark for building explainable AI systems in legal analysis. The findings underscore the importance of factual precision and domain-specific tuning in enhancing predictive performance and interpretability, positioning TathyaNyaya and FactLegalLlama as foundational resources for AI-assisted legal decision-making.


GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations

Chlapanis, Odysseas S., Galanis, Dimitrios, Aletras, Nikolaos, Androutsopoulos, Ion

arXiv.org Artificial Intelligence

We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.


Judging by Appearances? Auditing and Intervening Vision-Language Models for Bail Prediction

Basu, Sagnik, Prakash, Shubham, Barge, Ashish Maruti, Jaiswal, Siddharth D, Dash, Abhisek, Ghosh, Saptarshi, Mukherjee, Animesh

arXiv.org Artificial Intelligence

Large language models (LLMs) have been extensively used for legal judgment prediction tasks based on case reports and crime history. However, with a surge in the availability of large vision language models (VLMs), legal judgment prediction systems can now be made to leverage the images of the criminals in addition to the textual case reports/crime history. Applications built in this way could lead to inadvertent consequences and be used with malicious intent. In this work, we run an audit to investigate the efficiency of standalone VLMs in the bail decision prediction task. We observe that the performance is poor across multiple intersectional groups and models \textit{wrongly deny bail to deserving individuals with very high confidence}. We design different intervention algorithms by first including legal precedents through a RAG pipeline and then fine-tuning the VLMs using innovative schemes. We demonstrate that these interventions substantially improve the performance of bail prediction. Our work paves the way for the design of smarter interventions on VLMs in the future, before they can be deployed for real-world legal judgment prediction.


ALARB: An Arabic Legal Argument Reasoning Benchmark

Shairah, Harethah Abu, AlHarbi, Somayah, AlHussein, Abdulaziz, Alsabea, Sameer, Shaqaqi, Omar, AlShamlan, Hebah, Knio, Omar, Turkiyyah, George

arXiv.org Artificial Intelligence

We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well as the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset's utility for instruction tuning. Notably, we show that instruction-tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.


GLARE: Agentic Reasoning for Legal Judgment Prediction

Yang, Xinyu, Deng, Chenlong, Dou, Zhicheng

arXiv.org Artificial Intelligence

Legal judgment prediction (LJP) has become increasingly important in the legal field. In this paper, we identify that existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge. Therefore, we introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules, thereby improving the breadth and depth of reasoning. Experiments conducted on the real-world dataset verify the effectiveness of our method. Furthermore, the reasoning chain generated during the analysis process can increase interpretability and provide the possibility for practical applications.


Event Grounded Criminal Court View Generation with Cooperative (Large) Language Models

Yue, Linan, Liu, Qi, Zhao, Lili, Wang, Li, Gao, Weibo, An, Yanqing

arXiv.org Artificial Intelligence

With the development of legal intelligence, Criminal Court View Generation has attracted much attention as a crucial task of legal intelligence, which aims to generate concise and coherent texts that summarize case facts and provide explanations for verdicts. Existing researches explore the key information in case facts to yield the court views. Most of them employ a coarse-grained approach that partitions the facts into broad segments (e.g., verdict-related sentences) to make predictions. However, this approach fails to capture the complex details present in the case facts, such as various criminal elements and legal events. To this end, in this paper, we propose an Event Grounded Generation (EGG) method for criminal court view generation with cooperative (Large) Language Models, which introduces the fine-grained event information into the generation. Specifically, we first design a LLMs-based extraction method that can extract events in case facts without massive annotated events. Then, we incorporate the extracted events into court view generation by merging case facts and events. Besides, considering the computational burden posed by the use of LLMs in the extraction phase of EGG, we propose a LLMs-free EGG method that can eliminate the requirement for event extraction using LLMs in the inference phase. Extensive experimental results on a real-world dataset clearly validate the effectiveness of our proposed method.


From Dissonance to Insights: Dissecting Disagreements in Rationale Construction for Case Outcome Classification

Xu, Shanshan, Santosh, T. Y. S. S, Ichim, Oana, Risini, Isabella, Plank, Barbara, Grabmair, Matthias

arXiv.org Artificial Intelligence

In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.


Zero-shot Transfer of Article-aware Legal Outcome Classification for European Court of Human Rights Cases

Santosh, T. Y. S. S, Ichim, Oana, Grabmair, Matthias

arXiv.org Artificial Intelligence

Holzenberger et al. 2020 has modeled statutory Legal Judgment Prediction (LJP) has recently reasoning by classifying US tax law provisions gained considerable attention in the mainstream concatenated with textual case descriptions. We NLP community (e.g., Aletras et al. 2016; build on this prior work in two ways. First, we Chalkidis et al. 2019, 2021, 2022b; Santosh et al. develop and evaluate our model on a public dataset 2022, 2023). In LJP, the outcome of a case should (Chalkidis et al., 2022b) of cases by the European be classified/predicted based on a textual description Court of Human Rights (ECtHR), which hears complaints of case facts. In actual legal reasoning, legal by individuals about possible infringements practitioners (e.g., advocates, judges) determine relevant of their rights enshrined in the European Convention rules from the sources of law (e.g., statutes, on Human Rights (ECHR) by states. To the regulations, precedent) that are relevant to the case best of our knowledge, this is the first work applying at hand. They then carry out an analysis to determine article-aware case outcome prediction setting which rules apply to the case at hand, and to human rights adjudication.


Discovering Latent Strategies

Xu, Xiaoxi (University of Massachusetts Amherst)

AAAI Conferences

Strategy mining is a new area of research about discovering strategies in decision-making. In this paper, we formulate the strategy-mining problem as a clustering problem, called the latent-strategy problem. In a latent-strategy problem, a corpus of data instances is given, each of which is represented by a set of features and a decision label. The inherent dependency of the decision label on the features is governed by a latent strategy. The objective is to find clusters, each of which contains data instances governed by the same strategy. Existing clustering algorithms are inappropriate to cluster dependency because they either assume feature independency (e.g., K-means) or only consider the co-occurrence of features without explicitly modeling the special dependency of the decision label on other features (e.g., Latent Dirichlet Allocation (LDA)). In this paper, we present a baseline unsupervised learning algorithm for dependency clustering. Our model-based clustering algorithm iterates between an assignment step and a minimization step to learn a mixture of decision tree models that represent latent strategies. Similar to the Expectation Maximization algorithm, our algorithm is grounded in the statistical learning theory. Different from other clustering algorithms, our algorithm is irrelevant-feature resistant and its learned clusters (modeled by decision trees) are strongly interpretable and predictive. We systematically evaluate our algorithm using a common law dataset comprised of actual cases. Experimental results show that our algorithm significantly outperforms K-means and LDA on clustering dependency.